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Abstract It has been shown in literature that the Lasso estimator, or l\- 
penalized least squares estimator, enjoys good oracle properties. This paper 
examines which special properties of the ^-penalty allow for sharp oracle re- 
sults, and then extends the situation to general norm-based penalties that sat- 
isfy a weak decomposability condition. 
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1 Introduction 



The Lasso dTibshiranil [l99d |) has become extremely popular in the last several 
years. It is a computationally tractable method for high-dimensional models, 
with good theoretical properties. Several types of modificati ons of the Lasso 
have been introduced and studi ed, such as the fused Lasso (ITibshirani et al 



2005]) and the smoothed Lasso (jrlebiri and van de Geerl [201 ll j). In this paper 



we are primarily interested in extensions of the t\ -penalt y to general structure d 



Jenatton et al.l [20 111 ] and lMicchelli et al 



of the oracle inequalities given in lBachl 20101 ] and exten d the sharp oracle results 



sparsity penalties such as the group Las so introduced by lYuan and Lin 
and further structure d ve rsions given by Zhao et al.1 [20091 ] , IJacob et al. 




20101 ] . We will provide sharp versions 



for t he Lasso and nuclea r norm penalization as given in lKoltchinskii et al .1 2011 ] 



and iKoltchinskiil 20111 ] to general structured sparsity penalties, where we in 
addition prove inequalities for the estimation error. 

Consider the linear model 

Y = X/3° + e, 

where Y is an n- vector of observations, X is a given nxp matrix, e is an n- vector 
of errors and [3° is a p- vector of unknown coefficients. The Lasso estimator is 



13:-- 



arg mm 



\Y-X(3f n + 2\\\(3\\ 1 



Here, 



:= Yjj=i denotes the £i-norm of the vector j3 and for a vector 
v £ M n we let \\v\\ n be the normalized Euclidean norm \\v\\ n := \J v T v/n. Finally 
A > is a tuning parameter. The £i-penalty is a variable selection or (soft- 
thresholding type penalty: the larger A, the more coefficients 0j will be set to 
zero. 



In thi s paper, we first briefly review a sharp oracle result of iKoltchinskii et al 



201 ll ] for the Lasso estimator. We then extend the sharp oracle result to other 



norm-penalties, satisfying a weak decomposability condition as given in Section 

m 

The paper is organized as follows. We first introduce the concept "effective 
sparsity" in Section [2l Effective sparsity plays a crucial role in all our re- 
sults. As a benchmark, we then restate in Section [3] an oracle inequality from 



2 



Koltchinskii et al.l 201 ll ] for the Lasso. Theorem 14.11 in Section U contains the 
main result. It extends the £i-norm penalty to general weakly decomposable 
norm-penalties. Some examples are given in Section [SJ In Section O we con- 
sider comparison of the effective sparsity based on the £i-norm to the effective 
sparsity based on a different norm. A brief discussion of the results and further 
research is given in Section UJ Finally, Section [8] contains the proofs. 



2 The ^i-eigenvalue and effective sparsity for the i\- 
norm 

To state an oracle result, we need to define the ^-eigenvalue 5(L,S), where 
L > is a constant and S C {1, ... ,p} is an index set. We use the notation 

Pj,S G S}, j = l,... ,p. 



Thus (3s is a p- vector with zero entries at the indexes j ^ S. We will sometimes 
identify fts with the vector {/3j}j € s G M' 5 '. 

Definition 2.1 For constant L > and an index set S, the i\- eigenvalue is 



8(L,S):=mmhxp s -XPs4 n : \\0 s \\l = 1, < 4- 

T/ie compatibility constant is 

<P 2 (L,S) :=\S\5 2 (L,S). 



The g eometric interpretation of the ^i-eigenvalue, as given in Ivan de Geer and Lederer 



20121 ]. is as follows. Let Xj G K" denote the j-th column of X (J = 1, . . . ,p). 



The set {Xfis ■ \\Ps\\i = 1} is the convex hull of the vectors {±Xj}j € s in 
Likewise, the set {X(3$ c '■ \\Ps c \\i — ^} is the convex hull including interior 
of the vectors {±LXj}j € s c - Thus, the ^i-eigenvalue 5(L,S) is the distance be- 
tween these two sets. We note that: 

- if L is large the ^-eigenvalue will be small, 

- it will also be small if the vectors in S exhibit strong correlation with those 
in S c , 

- when the vectors in {Xj}j e s are linearly dependent, it holds that 

{XP S : \\Ps\\l = l} = {XPs- \\Ps\\l<l}, 
and hence then 5(L, S) = 0. 

The compatibility constant was introduced in Ivan de Geerl [20071 ] . Its name 
comes from the idea that when 4>(L, S) is large the normalized Euclidean norm 
|| • || n and the ^i-norm || • ||i are in sense compatible. The difference between 
the compatibility constant and the squared ^-eigenvalue lies only in the nor- 
malization by the size |S*| of the set S. This normalization is inspired by the 
orthogonal case, which we detail in the following example. 
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Example 2.1 Suppose that the columns of X are all orthogonal: Xj = 
for all j 7^ k. Assume moreover the normalization \\Xj\\ n = 1 for all j. Then 
clearly, 

\\xp s -XPs4 n = \\Ps-Ps°h, 



where \\ft\\2 '■= yY^=iPj ^ s ^ e ^2-norm of the vector (5. But 

Ws - Mil = WsWl + ll^lll > WsWl > \\Ps\\i/\s\, 



and in fact 



min ||/3s-/5 S c||l= min \\f3 s \\l = 1/\S\. 

HftHli<£. Ilfe||i=l \\Ps\\i=l 



It follows that 5 2 (L,S) = 1/\S\ and <j) 2 (L,S) = 1. 

A vector /3 is called sparse if it has only few non-zero coefficients. That is, 
the cardinality \Sp\ of the set Sp := {j : f3j 7^ 0} is small. We call \Sp\ the 
sparsity-index of j3. More generally, we call \S\ the sparsity index o f the set S. 
The effective sparsity, as defined in Ivan de Geer and Miillerl [20121 ] . takes into 



account the correlation structure in the design matrix X. 

Definition 2.2 For a set S and constant L > 0, the effective sparsity T 2 (L, S) 
is the inverse of the squared ^-eigenvalue, that is 

r 2 (L,s) 1 



5 2 (L,S)' 



In other words, for orthogonal design the effective sparsity of a set S is its 
cardinality, and in general, it is the inverse of the squared distance between the 
convex hull {Xfis : HAsHi = 1} and the convex set {Xf3s c '■ \\f3s c \\i < L}. 

Finally, we give a small numerical example from Ivan de Geer and Miillerl (20121 ]. 



Example 2.2 As a simple numerical example, let us suppose n = 2, p = 3, 
S = {3}, and 



( 5/13 1\ 
*" Vn Vl2/13 1 0) 



Since the sparsity index is \S\ = 1, the l\-eigenvalue 5(L,S) is equal to the 
square root <fi(L, S) of the compatibility constant, and equal to the distance of 
Xi to line that connects LX\ and —LX2, that is 



5(L,S) = max{(5-L)/V26,0}. 

Hence, for example for L = 3 the effective sparsity is T 2 (3, S) = 13/2. 
Alternatively, when 

Y r ( 12/13 1\ 

then for example 5(3, S) = and hence T 2 (3, S) = 00. This is due to the sharper 
angle between X\ and X3. 
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3 An oracle inequality for the fi-norm 



For a vector w 6 



we let 



io 



ma xi<j<p \ wj\ be the uniform norm. The 



following theorem is a slight extension of Koltchinskii et al. 201 ]| (we use the 
effective sparsity instead of restricted eigenvalues). The sparsity oracle inequal- 
ity in this theorem is a simple consequence of the following properties of the 
^i-norm: 

• Dual norm equality: sup{|w T /3| : 



Triangle inequality 
Decomposability: 



1 < 1} = IMloo, v W, 

+ Mi<m\i + W\\l, V/3,/3, 
= ||ft?||i + ||/3H|i,V/3, S. 



Note that the triangle inequality implies convexity: ||a/3 + (l — < a||/3||i + 

(1 — a) ||/3 ||i, V pj3 and all < a < 1. Convexity of the penalty is crucial for 
deriving oracle inequalities that are sharp. Lemma 18 . 1 1 gives the details. 

Recall the notation 

S := {j : Pj T^OUG W. 
Theorem 3.1 \Koltchinskii et all hoiij ) Let for S C {1, . . . ,p} 



X 



Define for A > A' 



Then 



\x{p - p 



< 



(e T X) s \\oo/n, X SC := \\(e T X) s 4oo/n. 
X + X s 



X-X sc 



mm < \\X(J3 - + (A + A 5 ) 2 r 2 (L 5 , S) 

S=Sp, \>\ s 



Thus, the Lasso trades off an approximation error ||X(/3 — /3 )!! 2 with an es- 
timation error (A + X S ) 2 T 2 (L, Sp). The above oracle inequality is called sharp 
because the constant in front of the appr oximation error \\X (0 — P )^ is one. 
Apart from iKoltchinskii et ah | [201lJ ] and iKoltchinskiil |201ll |. results in litera- 
ture are mostly non-sharp versions, with a constant larger than one i n front 
of the approximation error, see e.g. iBuhlmann and van de Geerl 20111 ] . It is 
interesting to note that convexity of the penalty plays a crucial role, e.g., with 
the io-penalty one cannot arrive at sharp oracle results. Observe that we do 
not present a bound for the £i-error in Theorem 13.11 We will show how such a 
bound can be included in the results in Theorem 14.11 

Remark 3.1 It is as yet not clear to what extent l\-eigenvalue conditions 
are necessary for oracle behavior of the prediction error \\X(j3 — Z? )!] 2 of the 
Lasso estimator. For example, if the design matrix X has repeated columns ( or 
columns that are proportional) in the set S, then the l\-eigenvalue will be zero. 
A reparametrization argument shows however that the Lasso estimator behaves 
as if repeated columns are treated as one. 
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4 A sharp oracle inequality for general weakly de- 
composable penalties 



Let f2 be some norm on MP, and let f3 be the norm-penalized estimator 
P := ki ■= arg min ( \\Y - Xpf n + 2Afi(/3) 



We will derive an oracle inequality for /3 for weakly decomposable norms $7, a 
notion introduced in Definition 14.11 

Recall that the ^i-norm is decomposable: = [|/3s||i + ||As e ||i for all vectors 
f3 and any set 5. The triangle inequality of course holds for any norm Q and so 
does the dual norm equality with the uniform norm replaced by the dual norm 

£l*(w) := sup |u> T /3|. 
fJ(/8)<l 

We stress that the triangle inequality and dual norm equality fail to hold if 
we replace the norm by powers of that norm. For example, the triangle in- 
equality does not hold for || • |||, which will mean the ridge regression penalty 
does not fall within our framework. Returning to a general norm Q, it is 
not necessarily decomposable. Decomposability is however very useful for the 
de rivation of oracle in e qualities, an observ ation which was discussed previously 
by van de Geei _ 2001 K_ van de Geer 2010l ] (where the property is called separa- 



bility) and iNegahban et al.1 20121 ]. Note that powers of norms can be decom- 
posable, for example \\PW2 = 1 1 Ps II 2 + IIAsHli- However, the required triangle 
inequality does not hold for || • ||§. 

We will show now that decompos ability is not a necessary condition for oracle 



results. This was also realized by iBachl 20101 ] . although there the situation is 



restricted to structured sparsity norms defined by sub- modular functions. We 
consider general norms £1, which are perhaps not decomposable, but only weakly 
decomposable for certain index sets S, which means that the norm 0(/3) of an 
arbitrary vector j3 is always superior to the sum of norms of fis and /3s c - 

Definition 4.1 Fix some set S. We say that the norm £7 is weakly decompos- 
able if there exists a norm £ls c on R p ~' 5 ' such that for all j3 S MP, 

Definition 4.2 We say that S is an allowed set (for £1) if O is weakly decom- 
posable for S. 

The best choice for is to take fi^^sc) as large as possible (see also Section 
[7]). We identify fis c with the (p— |5|)-vector {f3j}j e s c and consider £l sc as norm 
on RP^I 5 ! instead of W p . There may be no "natural" extension to a norm on M p 
(see Section 15.31 for an illustration) , and an extension is also not needed. 

Observe that any norm is trivially (weakly) decomposable for the empty set 
and for the complete set {1, . . . ,p} containing the indices of the all variables. 
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Some examples, where we in particular discuss nontrivial choices of S, will be 
given in Section [5j 

We also extend the definition of ^i-eigenvalues and effective sparsity to general 
weakly decomposable norms. 

Definition 4.3 Suppose S is an allowed set. Let L > be some constant. The 
Q-eigenvalue (for S) is 



5 n (L,S) := mm^\\X(3 s - X(3 S c\\ n : Q((3 S ) = 1, fi* (foe) < L 
The £1- effective sparsity is 

Tl(L,S) :=-g-i 



The Q-eigenvalue 5^(L,S) depends on the choice of the norm n s ", but we do 
not express this in our notation. It has a similar geometric interpretation as 
the ^i-eigenvalue: 5q(L, S) is the distance between the sets {Xfis : Q((3s) = 1} 
and {Xfis c '■ Q - (aV) < L}. The shape of these sets depends heavily on the 
norms Q and £~l sc . 

We will use the effective sparsity to bound the norm of f3s in terms of ||X/3|| n , 
as detailed in the following lemma. Here we use the "cone condition" for S7. 

Definition 4.4 Let L > be some constant, S some allowed set and (3 £ 
W some vector. We say that f3 satisfies the (L, S)-cone condition for £1 if 
n sc (f3 S c)<Lfl(f3 s ). 

Lemma 4.1 Suppose S is an allowed set. Then 

5q(L,S) = mini : P satisfies the (L,S)— cone condition, fo ^ 

I "(As) 

and hence, for all (3 that satisfy the (L, S)-cone condition, 

n(f3 s )<T n (L,S)\\X/3\\ n . 



The ingredients for an oracle inequality are now: 

• the dual-norm equality, 

• the triangle inequality, 

• weak decomposability. 

In other words, the situation is as for the Lasso, but the decomposability prop- 
erty is weakened. The dual norm of f2 is denoted by O*, that is 

Sl*(w) := sup \w T /3\, w 6 R p . 
n(p)<i 

We moreover let fif c be the dual norm of f2 s< \ 



7 



Theorem 4.1 Let (3 E MP be arbitrary and let Let S D {j : /3j 7^ 0} 6e an 
allowed set. Define 



A* := 0, ( (e J X) s /n ) , A*" := nf (e T X) 5c /n 



Suppose 

Define for some < £ < 1 



A > A 5C . 



A + A s \/l + <5 



A - A Sc / V 1 — <5 



Then 



\X{P ~ nt + S(X - X b W {ps°) + S(X + X>M(3s - (3) 

-1 2 



<||X(/3-/3 u )||^ + 



^1 + 5)(A + A 



5^ 



mL S ,s). 



Theorem 14.11 requires that S D Sp is an allowed set. If, for values of j3 that 
one considers as good approximations of /3°, the smallest allowed set S D Sp 
is much larger than Sp, then the penalty is simply not suited to describe the 
underlying sparsity structure. 

As a special case, one may take (3 = /3q and So the smallest allowed set contain- 
ing all non-zero (3® (j = 1, . . . ,p). However, the trade-off between approxima- 
tion error — /3°)||^ and estimation error (X + X s ) 2 Tq(Ls, S) will give better 
bounds. Theorem 14.11 is sharp as the constant in front of the approximation 
error — /3 )||„ is one. The choice 5 = is optimal if one only is interested 

in bounds for the prediction error ||-X~(/3 — /3 ) 



\l 



5 Some examples 
5.1 The Lasso 

The ^i-norm O(-) := || • ||i is (weakly) decomposable for all S, with £ls c = ^> 
and 12* = || • ||oo. Hence, for all j3 the set Sp is an allowed set, that is, we can take 
S = Sp in Theorem 14.11 The choice 5 = then gives Theorem 13.11 For 5 > 
however, we see that we also obtain a bound for ||/3 — f3\\i, and hence for the 
^i-estimation error error ||/3 — /3°||i. Here, one can use the triangle inequality 
11/3 - f3°\\i\\ < ||/3 - f3\\i + ||/3 - /3°||i, i.e., it again involves a trade-off. 



5.2 Group Lasso 

Also the group Lasso norm || • 1 1 2, 1 falls within the framework of decomposable 
norms. Let Gt C {1, . . . , T}, L)f =1 Gt = {1, . . . ,p}, G\C\- ■ ■ Gt = be a partition 
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of {1, . . . ,p} into disjoint groups. The norm corresponding to the group Lasso 
penalty is 



fi(/3) 



2,1 



i=l 



It is (weakly) decomposable for S = UteTGt (T being any subset of {1, ... , T}), 
with Qs c = ^- Thus, we can take S := L){Gt : H/^Gtlb 7^ 0} as allowed set, 
that is, as soon as /3j ^ for some j G Gt, we take the whole group of indexes 
Gt into our allowed set S. The dual norm is 



m^ll^cj^/vl^l, tfl G 



Let Xa t '■= {^j}j&G t be the n x | C t | design matrix of the variables in group 
t (t = 1, . . . , T). Suppose that within groups the design is orthonormal, that 
is X^X G Jn = I for all t. Then \\Xp Gt \\n = WPGth and when e ~ A/"(0,I), 
the random variables \\(e T X) Gt \\\ have a x 2 -distribution with \Gt\ degrees of 
freedom. Thus, 



\ 2 

An 



(e T X) 



1 2,00 



is the maximum of T normalized x 2 -random variables. Invoking probability 
inequalities for suc h maxima, Theorem HTI then gives similar (but sharp) ora cle 
results as those in lLounici et al. 2011 ] or Biihlmann and van de Geer 2011 ]. 



5.3 General structured sparsity 

The follo wing example describ es a general structured sparsity norm, as intro- 



duced by lMicchelli et al.1 [20101 ]. Let A C [0, oo) p be some convex cone, satisfy- 



ing A U (0, oo) p / 0, and 

P / #2 



Q(/3) :=n(p ] A):=wm±Y t (?i- + a,) 

a&A 1 \CLj J 

j = l 



Here we use the convention 0/0 = 0. The assumption AL) (0, oo) p 7^ says that 
there is an a G A with all ent ries positive, so that for all j3, Cl(/3) < 00. It is 
shown in iMicchelli et"afl |20lrf | that is indeed a norm. 



Let As ■= {as '■ a £ A}. 

Definition 5.1 We call As an allowed set, if 

A s C A. 

Thus we use the same terminology for sets in R p (such as As) and index sets 
S. 

Lemma 5.1 Suppose As is an allowed set. Then S is allowed, that is if we 
take 

n sc (p S c) = n(j3s°;A s °), fa g m?-w, 

where As<= '■= {as c ■ a G .4.}, then the set S is weakly decomposable for fL 
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Note that As c is a cone and that there always is an age g As c which has all 
entries positive except for those in A. Hence the restriction of Q(-;As<^) to 
{fis c '■ P G K p } is a norm. We do not require As^ to be an allowed set. 

Example 5.1 As in \Micchelli efoZI \20ld ]. consider the 



convex cone 



A := {ai > a2 > ■ ■ ■ > a p > 0}. 

The norm-penalty with norm £l((3,A) then favors putting the last indexes equal 
to zero. Moreover, for any s, the set of the first s indexes {l,...,s} is an 
allowed set. A partition {Gt}J = i is called contiguo us if for all t = 1. . . . , T — 1 
and all j £ Gt and k G Gt+i it holds that j < k. In lMicchelli et al\ 1201 A] it is 
shown that for all ft there is a unique contiguous partition {Gt}J = \ of {I, . . . ,p} 
such that 

T 

n{^A) = Y J V\GAWG t h- 

t=i 

We now return to the general norm £l(-;A). Its dual norm is 



Q*(w;A) = max 



where ^4(1) := {a G A : ||a||i = 1}. A similar expression holds for the dual 

norm nf of n sc . 



Maurer and Pontil [2012] provide moment inequalities for Q, if {e T X] A). They 



show that when e ~ A/"(0, 1), then 

En*(e T X;A)/n < A e , 

with 



Xr 



^1^2 + Vlog lextreme points of A(l)\^j ^=^ Xf ^ . 



where Xj = (x^j,. . . , Xi tP ) is the i-th row of X. Using concentration of measure 
(iTalaerandl [1995]) , this can be turned into a suitable probability inequality. 
Again, the results can be applied to S7f c as well. 

The £i-norm is a special case of the structured sparsity norm, with A = [0, co) p . 

The norm || • [| 2,1 corresponding to the group Lasso, as described in Subsection 
is also a special case, with 

A := {a G [0, oo) p is constant within groups}. 
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5.4 A trivial example 

A trivial example is the norm 

Sl G (J3) := V\G\\\PGb + \\P G 4l, 

which is a special case of the group Lasso norm, with n — \G\ + 1 groups, namely, 
the group G and n — \G\ groups {j}j<£G> each containing only one element. It 
is weakly decomposable for each S D G with Q sc = \\ ■ \\\. We will invoke 
this example mainly for facilitating our discussion of the relation between 0- 
eigenvalues (see Section [6]). 



5.5 Overlapping groups 

In this example, we consider a norm corresponding to the group Lasso with 
overlapping groups d.Tacob et al.1 [20091 ]). Let {G t }J =1 be subsets of {1, . . . ,p}, 
with uf =1 Gt = {1, . . . ,p}, and define 

^overla P (/5) := mini ^ ||fe t || 2 : (& t ) G c = V t, = P >. 

{j=l t=l > 

The paper Ijacob et al.l [20091 ] shows that r2 over i ap is indeed a norm. However, 
as such ^overlap is not weakly decomposable for useful candidate sets S. On the 
other hand by a reparametrization with parameters {bt}teT, we can reformulate 
the overlapping group Lasso problem into a group Lasso problem with non- 
overlapping groups. To see this, note that 



T 



t=i t=i 

Thus, the overlapping group Lasso estimator is j3 = Y2t=i bt, where 

{6af =1 :=arg min i\\Y - V^MS + 2AV ||6 t || 2 l. 

The augmented model has p := J2t=i \Gt\ parameters {bj yt : j € Gt}f =1 and 
the augmented groups are Gt ■= {(j,t) : j G Gt} (t = 1, . . . ,T), which are by 
definition non-overlapping. However, in the augmented design matrix 



T 

t=l 



the column Xj appears Nj := Ylt=i Hi ^ C*} times (j = 1, . . . ,p). Although 
such repetitions are not a problem for the Lasso (see Remark 13. ip . the implica- 
tions for the group Lasso are not so clear. 



11 



6 Comparing ^-eigenvalues 



The question arises to what extend using a norm-penalty with norm fi different 
from the £i-norm results in better oracle inequalities. This partly depends on 
the behavior of the dual norm, a topic we briefly discuss in Section [7l It also 
depends on the behavior of the fi-eigenvalues, which is the theme of the present 
section. 

Fix a set S and consider again the norm fig-defined in Section 15.41 

n s (P) = V\s\\\Psh + \\Ps4i- 

This norm is decomposable for S with Q sc = || • || 1 . The fig-eigenvalue 5q s (L, S) 
is the distance between the contour of the ellipse {Xj3s '■ \\PsW2 = l/y / |<S'|} 
and the convex hull including interior {X(3s c '■ \\Ps c \\i — L}. 

Remark 6.1 In fact, 5n s (L,S) is in part easy to compute: for fixed fls c one 
calculates 

min \\XPs~XPs4l ■=T^ 2 (^)- 
||/3 S ||1=1/|S| 

This is a quadratic minimization problem with quadratic restriction, which can 
be solved using Lagrange calculus. The more difficult part is to find the mini- 
mizer of1Z 2 {f5s c ) over all \\fis c \W — L- 

In Biihlmann and van de Geer 201 l[ |. \S\ x 5^ s (L,S) is called the adaptive 
restricted eigenvalue (because it occurred there in conjunction with the adaptive 
Lasso) . 

Recall that 5(L,S) is the l\ -eigenvalue. Since ||/3s||i < \/|5[||As||2j one easily 
checks that 

S(L,S)>Sn s (L,S), 
i.e., the ^i-eigenvalue 5(L, S) is better behaved than the fig-eigenvalue Sq s (L, S). 



Consider now the structured sparsity norm fl(-;A) introduced in Section 
By Lemma [5 -H we know that under the condition that As is allowed, the norm 
Q(-;A) is weakly decomposable for S with fi^/Jgc) = fi(/3g c ; _4gc). 

We note that 

Sl(j3;A)>\\P\\x, 
and is £ A, where 1 is the constant vector 1 : = (1, • • • , 1), then 



n(p s ;A) < y/\s\\\Psh- 

In other words, £l(-,A) intermediates the l\- and ^-norrn- 

Lemma 6.1 Suppose As is allowed and that ls £ A, where t is the constant 
vector l := (1, • • • , 1). Then for all L > 0, 

Su(L,S) > S Qs (L,S). 



12 



It follows that 

T 2 (L,S)<T 2 ns (L,S). 
and more generally, under the conditions of Lemma 16,11 

T 2 n (L,S)<T 2 ns (L,S). 

The O-effective sparsity Tq(L,S) is in general not comparable to the || • In- 
effective sparsity T 2 (L,S) for the ^i-norm || • ||. This is only partly due to the 
fact that the cone condition for £1 and the cone condition for || • |h are not 



comp arable. We finally note that the restricted eigenvalue (see iBickel et al 
(2009^ is in between \S\6q s (L, S) and \S\5 2 {L, S), and that the fi-eigenvalue 
5a(L,S) is not comparable to the restricted eigenvalue either, which is now 
solely due to the incomparability of the cone conditions. 



7 Discussion 



We have shown that sparsity oracle properties hold for the least squares estima- 
tor with separable norm-penalty. There are a few issues that can be addressed 
here. 

First of all, the choice of a norm other than || • ||i can be inspired by the practical 
use: the estimator may have a better interpretation. On the other hand, it may 
be harder to compute. 

The second point is that with another norm, the dual norm may better behaved 
than with the ^inorm. This is the case for for instance the group Lasso, which 
wins in certain cases from the Lasso by a logp-term. In this paper, we have not 
discussed in detail the properties of the dual f2* ((e T X)s) or ((e T X)sA t o 
avoid digressions. General results can be found in iMaurer and Pontill |2012j |. 



Larger norms have smaller dual norms, that is if (/3) > Q(f3) for all /3, then 
Q*(w) < Q*(w) for all w. Note that Theorem 14.11 gives bounds for the fi-error 
of (3s , so not only its dual norm is smaller than that of f2, but also the bound 
holds for the f]-error. In particular, this comparison can be made between 
the structured sparsity norm £l(-;A) defined in Section [531 and the £i-norm, 
because Q(f3;A) > ||/3||i for all j3. Note further that Theorem 14. 1 1 also involves 
£l sc and its dual 0,^ c , and that its result can be optimized by taking the largest 
possible choice for £l s (which will then also optimize the ^-eigenvalue) . 

Of course, the prize to pay for using a norm different from l\ is that it may only 
be weakly decomposable for relatively large sets S. That is, one should choose 
a norm that corresponds to a priori knowledge on the sparsity structure. 

It is to be noted further that with invoking the dual norm equality one might 
not exploit in full t he structure of the problem. M ore refined techniques are 
given in for example van de Geer and Lederer 20121 ]. 



In cases where the penalty involves a "smoothness" norm (for example a Sobolev 
norm), the philosophy is again different. In the classical setup, such a penalty 
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is invoked for establishing (non-adaptive) smoothness only. In more recent 
settings, the aim is to obtain both sparsity and sm oothness. An exam ple, con- 
cerning the high-dimensional additive model, is in iMeier et al.l 2009]. There, 
the issue of decomposability, comes up as well. Oracle results are derived us- 
ing a penalty that is not only sparsity decomposable bu t also "smoothness" 



decomposable (see also iBuhlmann and van de Geerl [201 1| . Section 8.4.5). 



Finally, the oracle results can be exte nded to loss functi ons other than least 



squar es (for example in the spirit of Ivan de Geerl 20081 or Negahban et al 



2012j [). Sharp oracle results are discussed in Ivan de Geerl 



20131 ] . For the quasi- 



likelihood loss with canonical link function, the dual-norm argument can again 
be used. For other cases this argument generally has to be replaced. Here, 
tools from empirical process theor y can be invoked (such as those outlined in 
Biihlmann and van de Geerl 20111 ] . Chapter 8). 



8 Proofs 

Proof of Lemma ED Let (3 G C := {n sc (p s -) < LQ,{p s ) / 0}. Write 



Then fi(/%) = 1 and tt s \p S c) < L, and hence 



It follows that 



\\xp s + XPsc\\ n . 



mm rTTo n = 5 n{L, S). 



a 



The next lemma shows why convexity of the penalty is important. The result 
can be extended to los s functions o t her th an quadratic loss, see the rejoinder 
in the discussion paper Ivan de Geeri [2013j |. 



Lemma 8.1 Let B be a convex subset of W and pen : B — > R be a convex 
penalty. Let moreover 

(3 := argmin||y - X/3||2 +2pen(/3). 

Then for every (3 £ B 

(Y - XP) T X(I3 - j3)/n + pen(/3) < pen(/3). 

Proof. Fix f3 £ B and define for < a < 1, 

Pa := (1 - a)P + a/3. 
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We have 

\\Y - Xp\\ 2 n + 2pen(/3) < \\Y - Xp a \\ 2 n + 2pen(/3 Q ) 
< \\Y - Xp a \\* + 2(1 - a)pen(/3) + 2apen(/3) 
where we used the convexity of the penalty. It follows that 

\\Y-Xp\\l-\\Y-Xp a \\l + 2pen(/§) < 2pen(/?) _ 
a 

But clearly 

lim W Y - X PWn-W Y ~ X M 2 n = 2( y _ XP) T X{(3 - P) jn. 
alO a 

□ 

Proof [4TTI Let us write for v, w € M n , 

(w, to) := v T w/n. 
Fix some S l p and let 5 D {j : (3j / 0} be an allowed set If 

(x0 - p°),x0 - p)) n < -(5(\ + \ s )n0 s -P) + s(\ - \ sc Mp S c)) 

we find 

\\X0 " m + S(X + \ s )n0 s -(3) + 5(X - A 5 >(/3 SC ) - \\X(/3 - /P)\\* 

= 5(x+x s )n0 s -/3)+5(x-x sc )n0 S c)-\\x(/3-ml+2(x0-/3 o ),x0-f3)^ < 0. 

Hence, then we are done. 
Suppose now that 

(X0 - P°),X0 - (3)) n > -(5{X + X s )n0 s -(3) + 5(X - x sc )n0 S c)). 

By Lemma 18.11 we have 

((y - xp),x(p - p)) n + xn0) < xn0), 

or 

(X0 - /3°),X0 - p)) n + XU0) < (e, X0 - 0)) n + AO(/3). 
By definition of the dual norm, 

(e,X0-P)) n = (e,X0s-P)) n +(e,X0 S c-(3)) n < X s n0 s - P)+X sc n s ° S c). 
Thus 

(x0 - p°),x0 - p)) n + xn0) < x s n0 s -p) + x sc n sc s ^) + \n(P). 

By the weak decomposability of and the triangle inequality, this implies 

(X0 - P°),X0 - P)) n + (A - x sc )n sc S c) < (A + x s )n0 s - 0). (1) 
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Since (X0 - (3°\X0 - /?))„ > -(<5(A + X s )n0 s -(3) + 5(X - X s °)n sc S o)) 
this gives 

n sc S c)<L s n0 s -/3). 

We now insert Lemma 14.11 which gives 



n0 s -f3)<T n (L,S)\\X0-/3)\\n 
and continue with inequality JT]): 

(x0 - p°),x0 - p)) n + (A - \ sc )n0 S c) + 5(X + x s )n0 s - P) 

< [(l + 5)(X + X s )}T n (L s ,S)\\X0- 



(2) 



< 



(l + 5)(X + X s ) 



r 2 n (L s ,S) + -\\X(J3 



Since 



2(X0 - P°),X0 - p)) n = \\X0 - f}°)\\l - - m + \\X(Ji 

we obtain 



\\X{P - n\\ z n + 2(A - X b MPs*) + 25(X + X*)n((3 s - /?) 

i 2 



<\\x(p-n\\i+ 



(1 + <5)(A + A 



S\ 



mL S ,s). 



□ 



Proof of Lemma 15.11 Note that for any a and f3 



1 



2 ^U 



„ a 7 - 
jeS c v J 



3 



+ a. 



Hence, writing 



we have 



1 P f(3 2 \ 
a{(3) := argmin- V( — + ) , 
a€.A ^ ■ j V a j / 



3 



3 



IE 



E(^ + ^))nE 
+ ) + J E 



2 ^ W/9) ' "■' 



> min 

a s €.4s 2 



oE' 



+ Oj,s + mm 
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□ 

Proof of Lemma 16.11 Suppose 3 satisfies the (L,S')-cone condition for 0: 
then also 

nf(/3 sc ) = < n sc (p S c) < mps) < Ly/\s\\\p 3 h = Ln s {/3 S ), 

where in the last inequality we used i$ £ A. Hence, 3 satisfies the (L, 5)-cone 
condition for 0,$- But then 

n(3 s ) < n s (3 s ) < llx ' n " 



6n s (L,S)' 

The result now follows from Lemma 14.11 

□ 
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